-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: 📝 pseudo code and docstring for write_resource_parquet()
#816
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!! Just some questions.
I think I've developed some confusion about what the raw data files represent. Are they different versions of the data (with later versions overwriting earlier ones) or different sections of the data (e.g. one file for rows 1-100 and another one for rows 101-200)? Well, I guess there is no reason why they couldn't be used as both...
@martonvago I forgot to respond to your initial question. Raw files are kept from the initial upload to keep a record just in case something happens. A potential scenario might be, a first round of surveys are sent to people and that data gets uploaded to Sprout. That's one raw file. Maybe a few months later, the same survey is sent out and that data gets uploaded. That's another raw file. So those two raw files get merged together and saved as the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall picture of this makes sense to me as well 👍
…prout into docs/write-resource-parquet-pseudocode
…prout into docs/write-resource-parquet-pseudocode
@@ -127,13 +127,21 @@ flowchart | |||
function --> out | |||
``` | |||
|
|||
### {{< var wip >}} `write_resource_parquet(raw_files, path)` | |||
### {{< var wip >}} `build_resource_parquet(raw_files_path, resource_properties)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure of the naming here. And I'm unsure if it should output a DataFrame and have another function write_resource_parquet()
that does the writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if that DataFrame output is used somewhere else, have 2 functions, otherwise have one function that does the writing as well?
build
or create
sounds okay to me.
While Sprout generally assumes | ||
that the files stored in the `resources/raw/` folder have already been | ||
verified and validated, this function does some quick verification checks | ||
of the data after reading it into Python from the raw file(s) by comparing | ||
with the current properties given by the `resource_properties`. All data in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While Sprout generally assumes | |
that the files stored in the `resources/raw/` folder have already been | |
verified and validated, this function does some quick verification checks | |
of the data after reading it into Python from the raw file(s) by comparing | |
with the current properties given by the `resource_properties`. All data in the | |
While Sprout generally assumes | |
that the files stored in the `resources/raw/` folder are already correctly | |
structured and tidy, it still runs checks to ensure the data are correct | |
by comparing to the properties. All data in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very sensible to me 😁
@@ -127,13 +127,21 @@ flowchart | |||
function --> out | |||
``` | |||
|
|||
### {{< var wip >}} `write_resource_parquet(raw_files, path)` | |||
### {{< var wip >}} `build_resource_parquet(raw_files_path, resource_properties)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if that DataFrame output is used somewhere else, have 2 functions, otherwise have one function that does the writing as well?
build
or create
sounds okay to me.
|
||
If there are any duplicate observation units in the data, only the most recent | ||
observation unit will be kept. This way, if there are any errors or mistakes | ||
in older raw files that has been corrected in later files, the mistake can still |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in older raw files that has been corrected in later files, the mistake can still | |
in older raw files that have been corrected in later files, the mistake can still |
sp.write_resource_parquet( | ||
raw_files_path=sp.path_resources_raw_files(1, 1), | ||
parquet_path=sp.path_resource_data(1, 1), | ||
properties_path=sp.path_package_properties(1, 1), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be updated?
Description
Based on @martonvago's suggestion, I'll write things in "pseudocode" from now on. But instead of pseudocode, I will write an outline of the Python function with how I think it might flow inside. Plus, I can write the full docstrings inside, so you all don't need and we don't need to move it over from the Quarto doc. I have NOT ran this, tested it, or did any execution, this is purely how I think it might work, hence "pseudo" 😛. I'll add some comments directly to the code in the PR.
Closes #642
This PR needs an in-depth review.
Checklist